Using Path and glob instead of walk_files #1069

krishnakalyan3 · 2020-12-05T17:33:41Z

I wanted to take an initial stab at this. I have made a minor modification for

yesno
librispeech
libritts
speechcommands

Looks like tedlium, ljspeech and vctk (VCTK_092) don't have walk_files

krishnakalyan3 · 2020-12-05T17:34:19Z

To test

import torchaudio
yesno_data = torchaudio.datasets.YESNO('./', download=True)
yesno_data._walker

ls_data = torchaudio.datasets.LIBRISPEECH(".", download=True)
ls_data._walker

sc_data =  torchaudio.datasets.SPEECHCOMMANDS(".", download=True)
sc_data._walker

krishnakalyan3 · 2020-12-05T17:36:38Z

yesno_data._walker

Before:
['0_0_0_0_1_1_1_1',
'0_0_0_1_0_0_0_1',
'0_0_0_1_0_1_1_0',
'0_0_1_0_0_0_1_0',
'0_0_1_0_0_1_1_0', ...]

After:
['0_0_0_0_1_1_1_1',
 '0_0_0_1_0_0_0_1',
 '0_0_0_1_0_1_1_0',
 '0_0_1_0_0_0_1_0',
 '0_0_1_0_0_1_1_0', ...]

vincentqb · 2020-12-07T15:26:09Z

torchaudio/datasets/librispeech.py

-        walker = walk_files(
-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
-        )
+        walker = sorted([str(p.stem) for p in Path(self._path).glob('*/*/*'+self._ext_audio)])


nit: missing whitespace around arithmetic operator :)

vincentqb · 2020-12-07T15:26:50Z

torchaudio/datasets/yesno.py

-        walker = walk_files(
-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
-        )
+        walker = sorted([str(p.stem) for p in Path(self._path).glob('*.wav')])


nit: + self._ext_audio instead of .wav

vincentqb · 2020-12-07T15:28:39Z

Looks like tedlium does not contain walk_files/filter and I am assuming that ljspeech should also be added to the list?.

ljspeech also doesn't use walk_files, see here.

vincentqb · 2020-12-07T15:29:06Z

Thanks for looking into this! gave minor comments, but LGTM so far :)

krishnakalyan3 · 2020-12-12T16:39:35Z

@vincentqb looks like vctk does not have walk_files.

mthrok

Mostly looks good. Could you also remove the definition of walk_files?

mthrok · 2020-12-14T20:46:44Z

torchaudio/datasets/librispeech.py

-        walker = walk_files(
-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
-        )
+        walker = sorted([str(p.stem) for p in Path(self._path).glob('*/*/*' + self._ext_audio)])
        self._walker = list(walker)


Could you get rid of list(walker) as it is redundant?

I would do

self._walker = [str(p.stem) for p in Path(self._path).glob(f'*/*/*{self._ext_audio}')] self._walker.sort()

Good point, there's no need to make a copy of the list. It seems like using iterator+sorted is similar? Is this a fair comparison?

In [1]: %timeit sorted(range(1000, 0, -1)) 14.2 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [2]: %timeit sorted(list(range(1000, 0, -1))) 16.4 µs ± 370 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: def func(): ...: a = list(range(1000, 0, -1)) ...: a.sort() ...: In [4]: %timeit func() 15.4 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I've updated the PR with sorted(generator), since it turns out that sorted(generator) creates the list and then updates it in place anyway. Once the tests pass, I'll merge this PR as it is. Thanks @krishnakalyan3 for the quick follow-up :)

List comprehension is not a generator, please revert it.

ah, okay, you changed the list comprehension to a generator.

The next time, please consult it with me before doing so.

I cannot find a reference that says sorted is in-place. can you give the reference for that?

See implementation of sorted. Regardless, the test above points to that implementation to be faster.

vincentqb · 2020-12-15T15:59:25Z

Looks like tedlium, ljspeech and vctk (VCTK_092) don't have walk_files

Fixes: #1051

VCTK does have walk_files here. For simplicity, let's update VCTK and remove walk_files as a separate PR. Thanks again @krishnakalyan3 for the pull request :)

codecov · 2020-12-15T16:04:13Z

Codecov Report

Merging #1069 (6eddeb2) into master (3691b8e) will not change coverage.
The diff coverage is n/a.

@@      Coverage Diff       @@
##   master   #1069   +/-   ##
==============================
==============================

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3691b8e...ce06026. Read the comment docs.

vincentqb

Thanks again for the work! We can do VCTK and removing walk_files as follow-ups. I'll go ahead and merge this once the tests are passed.

vincentqb · 2020-12-15T19:24:21Z

LGTM! Merging. Thanks again @krishnakalyan3!

krishnakalyan3 · 2020-12-15T19:53:07Z

Thanks for all the awesome feedback gentlemen.

The VCTK corpus which uses walk_files is deprecated in favor of VCTK_092. At this moment even if I use VCTK to forces me to use VCTK_092 dataset. Hence, I was not sure what to do.

mthrok · 2020-12-15T20:16:54Z

Thanks for all the awesome feedback gentlemen.

The VCTK corpus which uses walk_files is deprecated in favor of VCTK_092. At this moment even if I use VCTK to forces me to use VCTK_092 dataset. Hence, I was not sure what to do.

@krishnakalyan3

It was not explained in the original issue, but the fundamental reason of these changes you contributed is removing walk_files. The use of walk_files makes it ambiguous who is responsible to locate the files. (Dataset class? or utility?), and in fact just glob-ing everything is not the right problem being solved in implementing Dataset, because if you have a specific dataset you consider to access, then the directory structure and file locations are determined. No need to do possibly-infinite recursion. Each Dataset implementation should be glob-ing the right set of files it requires.

Unfortunately, the original VCTK dataset is no longer publicly available and I do not have a copy so I do not know the expected structure of directory, but we can still move the glob logic to VCTK dataset so that VCTK dataset is responsible in locating the files. Would you like to work on it? If you do not have a time, I will file a separate issue for that.

krishnakalyan3 · 2020-12-15T20:27:56Z

@mthrok thanks for the explanation. I will work on the PR.

* Adds files for minGPT training with DDP * filtered-clone, update script path, update readme * add refs to karpathy's repo * add training data * add AMP training * delete raw data file, update index.rst * Update gpt2_train_cfg.yaml

facebook-github-bot added the CLA Signed label Dec 5, 2020

vincentqb reviewed Dec 7, 2020

View reviewed changes

krishnakalyan3 requested a review from vincentqb December 12, 2020 23:10

mthrok reviewed Dec 14, 2020

View reviewed changes

vincentqb mentioned this pull request Dec 14, 2020

Use glob in datasets instead of walk_files #1051

Closed

7 tasks

krishnakalyan3 requested a review from mthrok December 15, 2020 13:33

krishnakalyan3 and others added 17 commits December 15, 2020 11:03

initial changes for yesno dataset

3391106

librispeech to glob + path

79968d6

minor changes

0ba43ac

minor comment

4bc29cb

sc update

c97918e

change to Path libritts

f44d985

change to Path libritts

9081235

yes no update

5aab320

remove redundant list

41d742f

remove redundant list

f8fbbcb

add missing comma

bbed79c

remove walk_files

f70914b

walker sort

f162d64

fix sort

a6632a5

inplace sorting

9525171

inplace sort

bf3b267

use sorted(generator)

ce06026

vincentqb force-pushed the glob_dataset branch from 68938b6 to ce06026 Compare December 15, 2020 16:04

vincentqb approved these changes Dec 15, 2020

View reviewed changes

vincentqb merged commit d25a4dd into pytorch:master Dec 15, 2020

krishnakalyan3 mentioned this pull request Dec 17, 2020

VCTK Using Path and glob instead of walk_files #1101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Path and glob instead of walk_files #1069

Using Path and glob instead of walk_files #1069

krishnakalyan3 commented Dec 5, 2020 •

edited by vincentqb

Loading

krishnakalyan3 commented Dec 5, 2020 •

edited

Loading

krishnakalyan3 commented Dec 5, 2020

vincentqb Dec 7, 2020

vincentqb Dec 7, 2020

vincentqb commented Dec 7, 2020

vincentqb commented Dec 7, 2020

krishnakalyan3 commented Dec 12, 2020

mthrok left a comment

mthrok Dec 14, 2020

vincentqb Dec 14, 2020 •

edited

Loading

vincentqb Dec 15, 2020 •

edited

Loading

mthrok Dec 15, 2020

mthrok Dec 15, 2020

mthrok Dec 15, 2020

mthrok Dec 15, 2020

vincentqb Dec 15, 2020 •

edited

Loading

vincentqb commented Dec 15, 2020 •

edited

Loading

codecov bot commented Dec 15, 2020 •

edited

Loading

vincentqb left a comment •

edited

Loading

vincentqb commented Dec 15, 2020

krishnakalyan3 commented Dec 15, 2020 •

edited

Loading

mthrok commented Dec 15, 2020 •

edited

Loading

krishnakalyan3 commented Dec 15, 2020

Using Path and glob instead of walk_files #1069

Using Path and glob instead of walk_files #1069

Conversation

krishnakalyan3 commented Dec 5, 2020 • edited by vincentqb Loading

krishnakalyan3 commented Dec 5, 2020 • edited Loading

krishnakalyan3 commented Dec 5, 2020

vincentqb Dec 7, 2020

Choose a reason for hiding this comment

vincentqb Dec 7, 2020

Choose a reason for hiding this comment

vincentqb commented Dec 7, 2020

vincentqb commented Dec 7, 2020

krishnakalyan3 commented Dec 12, 2020

mthrok left a comment

Choose a reason for hiding this comment

mthrok Dec 14, 2020

Choose a reason for hiding this comment

vincentqb Dec 14, 2020 • edited Loading

Choose a reason for hiding this comment

vincentqb Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

mthrok Dec 15, 2020

Choose a reason for hiding this comment

mthrok Dec 15, 2020

Choose a reason for hiding this comment

mthrok Dec 15, 2020

Choose a reason for hiding this comment

mthrok Dec 15, 2020

Choose a reason for hiding this comment

vincentqb Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

vincentqb commented Dec 15, 2020 • edited Loading

codecov bot commented Dec 15, 2020 • edited Loading

Codecov Report

vincentqb left a comment • edited Loading

Choose a reason for hiding this comment

vincentqb commented Dec 15, 2020

krishnakalyan3 commented Dec 15, 2020 • edited Loading

mthrok commented Dec 15, 2020 • edited Loading

krishnakalyan3 commented Dec 15, 2020

krishnakalyan3 commented Dec 5, 2020 •

edited by vincentqb

Loading

krishnakalyan3 commented Dec 5, 2020 •

edited

Loading

vincentqb Dec 14, 2020 •

edited

Loading

vincentqb Dec 15, 2020 •

edited

Loading

vincentqb Dec 15, 2020 •

edited

Loading

vincentqb commented Dec 15, 2020 •

edited

Loading

codecov bot commented Dec 15, 2020 •

edited

Loading

vincentqb left a comment •

edited

Loading

krishnakalyan3 commented Dec 15, 2020 •

edited

Loading

mthrok commented Dec 15, 2020 •

edited

Loading